Pipelined processing of RDMA-type network transactions

ABSTRACT

A computer system such as a server pipelines RNIC interface (RI) management/control operations such as memory registration operations to hide from network applications the latency in performing RDMA work requests caused in part by delays in processing the memory registration operations and the time required to execute the registration operations themselves. A separate QP-like structure, called a control QP (CQP), interfaces with a control processor (CP) to form a control path pipeline, separate from the transaction pipeline, which is designated to handle all control path traffic associated with the processing of RI control operations. This includes memory registration operations (MR OPs), as well as the creation and destruction of traditional QPs for processing RDMA transactions. Once the MR OP has been queued in the control path pipeline of the adapter, a pending bit is set which is associated with the MR OP. Processing of an RDMA work request in the transaction pipeline that has engendered the enqueued MR OP is permitted to proceed as if the processing of the MR OP has already been completed. If the work request gets ahead of the MR OP, the associated pending bit being set will notify the adapter&#39;s work request transaction pipeline to stall (and possibly reschedule) completion of the work request until the processing of the MR OP for that memory region is complete. When the memory registration process for the memory region is complete, the associated pending bit is reset and the adapter transaction pipeline is permitted to continue processing the work request using the newly registered memory region.

BACKGROUND OF THE INVENTION

1. The present invention pertains to the field of computer architectureand more specifically to the efficient processing of RNIC interface (RI)management control operations (e.g. memory registration) required byRDMA (Remote Direct Memory Access) type work requests issued by an RNICinterface (RI) running on computer systems such as servers.

DESCRIPTION OF THE RELATED ART

In complex computer systems, particularly those in large transactionprocessing environments, a group of servers is often clustered togetherover a network fabric that is optimized for sharing large blocks of databetween the servers in the cluster. In such clustering fabrics, the datais transferred over the fabric directly between buffers resident in thehost memories of the communicating servers, rather than being copied andpacketized first by the operating system (OS) of the sending server andthen being de-packetized and copied to memory by the OS of the receivingserver in the cluster. This saves significant computing resources in thetransacting servers in the form of OS overhead that may be applied toother tasks. This technique for establishing connections that bypass thetraditional protocol stack resident in the OS of transacting servers andinstead transacting data directly between specified buffers in the usermemory of the transacting servers is sometimes generally referred to asremote data memory access or RDMA.

Different standards have been established defining the manner and theprotocols by which direct memory connections between servers aresecurely established and taken down, as well as the manner in which datais transferred over those connections. For example, Infiniband is aclustering standard that is typically deployed as a fabric that isseparate and distinct from fabrics handling other types of transactionsbetween the servers and devices such as user computers orhigh-performance storage devices. Another such standard is the iWARPstandard that was developed by the RDMA Consortium to combine RDMA typetransactions with packet transactions using TCP/IP over Ethernet. Copiesof the specifications defining the iWARP standard may be obtained at theConsortium's web site at www.rdmaconsortium.org. The iWARPspecifications and other documents available from the RDMA Consortiumweb site are incorporated herein in their entirety by this reference.These and other RDMA standards, while differing significantly in theirtransaction formats, are typically predicated on a common paradigmcalled a queue pair (QP). The QP is the primary mechanism forcommunicating information about where data is located that should besent or received using one of the standard RDMA network data transferoperations.

A QP is typically made up of a send queue (SQ) and a receive queue (RQ),and can also be associated with at least one completion queue (CQ). QPsare created when an application running on a local server issues arequest to an RNIC interface (RI) that a memory transaction be processedthat directly accesses host memory in the local server and possibly hostmemory in a remote server. The QPs are the mechanism by which workrequest operations associated with the processing of the transactionrequest made by the application are actually queued up, tracked andprocessed by the RNIC adapter.

The memory region(s) specified in a direct memory transaction arelogically (although not typically physically) contiguous. Thus, the RIalso coordinates retrieving a virtual to physical translation for thepages of physical memory actually used by a memory region and programsthe RNIC adapter with this information so that the RNIC may directlyaccess the actual physical locations in host memory that make up thememory region as if they were physically contiguous. Access privilegesare also retrieved for that memory region and stored in the RNIC withthe address translation information. This RI management process is knownas memory registration. Most RI management processes, including memoryregistration, are presumed by the RDMA standards to be a synchronousprocess such that they will complete before any associated work requestis processed by the RNIC on behalf of the application. Thus, amanagement process such as memory registration blocks the processing ofany associated work request by the RNIC until it is complete.

Because memory registration operations (MR OPs) must access many of thesame resources in the adapter that are also processing the execution ofpreviously enqueued work requests, because they can be large in number,and because they can be quite time consuming to perform when the virtualto physical translations lead to many physical addresses which all mustbe transferred to and stored within the RNIC, the completion of memoryregistration operations may be significantly delayed. This forces theadapter to block further processing of work requests associated with theMR OPs for the entire length of the delay. These factors cansignificantly increase the overall transaction latency from theperspective of the application, and thus decrease throughput of thefabric in general. This may not be tolerable for many applications.

Therefore, it would be desirable to decrease the latency of RDMA typetransactions (and thereby increase network throughput) between serverscaused by the blocking of RNIC work requests while they await completionof requisite RI management transactions such as memory registrationoperations. It would be further desirable to achieve this reducedlatency/increased throughput while maintaining compatibility with thespecifications of RDMA protocols that require serial completion ofmemory registration operations prior to performing RDMA memoryoperations from and to those regions.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram of a computer system including clustering,user access and storage area networking according to the prior art.

FIG. 2 is a block diagram of a computer system including a commonclustering and user access fabric and a storage area networking fabricaccording to the prior art.

FIG. 3A is a process flow diagram describing execution of a QP operationperformed by the computer system of FIGS. 1 and 2 in accordance with theprior art.

FIG. 3B is a process flow diagram describing a serial memoryregistration process called by the process of FIG. 3 in accordance withthe prior art.

FIG. 4 is a block diagram of a computer system including RDMA capableadapters that includes various features and embodiments of the presentinvention.

FIG. 5A is a process flow diagram describing a pipelined memoryregistration process performed by the computer system of FIG. 4 and inaccordance with embodiments of the present invention.

FIG. 5B is a process flow diagram describing execution of a postedmemory registration operation performed by the computer system of FIG. 4and in accordance with embodiments of the present invention.

FIG. 5C is a process flow diagram describing execution of a posted QPoperation performed by the computer system of FIG. 4 and in accordancewith embodiments of the present invention.

FIGS. 6A and 6B are a logical block diagram of an embodiment of theserver of FIG. 4.

FIG. 7 is a parallel sequence diagram describing one scenario in theexecution of a Local Send Operation in accordance with an embodiment ofthe invention.

FIG. 8 is a parallel sequence diagram describing one scenario in theexecution of a Remote RDMA Write Operation in accordance with anembodiment of the invention.

FIG. 9 is a block diagram of a computer system combining the fabrics forclustering, user access and storage area networking into all oneEthernet fabric according to the present invention.

FIG. 10 is a logical block diagram of an embodiment of the protocolengine of the server of FIGS. 6A and 6B.

DETAILED DESCRIPTION OF THE DRAWINGS

Processing of RDMA type network transactions between servers over anetwork typically requires that the memory regions comprising the sourceand target buffers for such transactions be pre-registered with theirrespective RDMA capable adapters through which the direct data placementtransactions will be conducted. The memory registration process provideseach adapter with a virtual to physical address translation for thepages of physical memory that make up the contiguous virtual memoryregion being specified in the RDMA operation, as well as the accessprivilege information associated with the memory region. Specificationsfor RDMA standard protocols, such as iWARP, require that this memoryregistration process be complete before the work request generated inresponse to the RDMA transaction specifying the memory region may beprocessed.

Embodiments of the present invention are disclosed herein that providetwo separate pipelines. One is the traditional transmit and receivetransaction pipeline used to process RDMA work requests, and the otheris a management/control pipeline that is designated to handle RI controloperations such as the memory registration process. Embodiments of theinvention employ a separate QP-like structure, called a control QP(CQP), which interfaces with a control processor (CP) to form thepipeline designated to handle all control path traffic associated withthe processing of work requests, including memory registrationoperations (MR OPs), the creation and destruction of QPs used forposting and tracking RDMA transactions requested by applications runningon the system.

In processing an RDMA memory transaction request from an application inaccordance with embodiments of the invention, an RDMA verb is calledthat identifies the requisite RI management processes that must beexecuted to program the adapter (i.e. RNIC) in support of that memorytransaction. Among these is typically a memory registration operation(MR OP) that is enqueued in a CQP of the adapter. Once the MR OP hasbeen queued in the control path pipeline of the adapter to register thememory region specified by the memory transaction, a pending bit is setfor that memory region and the call to the RDMA verb is returned. TheRDMA transaction is posted to the appropriate QP and the RI generates awork request for the adapter specifying access to the memory regionbeing registered by the pending MR OP. This work request is enqueued inthe transaction pipeline of the adapter.

The processing of the work request is permitted to proceed as if theprocessing of the associated MR OP has already been completed. If thework request gets ahead of the MR OP, the pending bit associated withthe memory region being registered will notify the adapter's workrequest transaction pipeline to stall (and possibly reschedule)completion of the work request until the processing of the MR OP forthat memory region is complete. When the memory registration process forthe memory region is complete, the pending bit for that memory region isreset and the adapter transaction pipeline is permitted to continueprocessing the work request using the newly registered memory region.Whenever the MR OP completes prior to the adapter transaction pipelineattempting to complete the QP work request, no transaction processing isstalled and the latency inherent in what has been traditionallyperformed as a serial process is completely hidden from the applicationrequesting the RDMA memory transaction. This serves to lower the overalllatency as well as increase the throughput of the network commensuratelywith the number and size of pending memory registration operations. Atthe same time, the memory registration process is guaranteed to completebefore the work request is completed, thus maintaining compatibilitywith the RDMA specification.

FIG. 1 illustrates a group of servers 100 that are clustered togetherover a network fabric such as Infiniband fabric 102, a systemconfiguration known to those of skill in the art. To further improveperformance, these clustered servers 100 can be provided withhigh-performance access to mass storage, illustrated as storage units106, over a separate storage area network (SAN) such as Fibre Channelfabric 104. The Fibre Channel architecture of the SAN 104 is optimizedfor mass storage access type transactions. User computers or other userdevices 110 can be provided access to the cluster of servers 100 throughyet another distinct network, such as an Ethernet network 108.

FIG. 2 illustrates yet another system configuration known to those ofskill in the art. In this example, the network transactions between theclustered servers 200 are performed over the Ethernet fabric 108, alongwith those packetized transactions typically transacted over theEthernet fabric 108 between the servers 200 and user devices 100. Eachof the clustered servers 200 of FIG. 2 includes RDMA capable adaptercards (not shown) that can coordinate RDMA type memory transactions overthe Ethernet fabric 108 as well as adapters that can handle standardEthernet packet transactions over TCP/IP. The operation of the hostprocessors and the RDMA adapters of the servers 200 comply with theiWARP specification as developed by the RDMA Consortium as previouslydiscussed and which has been incorporated herein by reference.

Common to both clustering implementations of FIGS. 1 and 2 is therequirement that regions of host memory that are to provide source andtarget buffers for RDMA memory transactions between transactingapplications running on the servers be registered to their respectiveRDMA adapters. The procedural flow diagrams of FIGS. 3A and 3B provide ahigh-level description of the memory registration process as iscurrently known to those of skill in the art. At step 300, the hostprocessor issues an RDMA type memory transaction request at the behestof some application running on the host processor. This RDMA requestspecifies the use of memory region x within its host memory as one ofthe buffers to be used in the transaction. At step 305, it is firstascertained if memory region x is already registered. If YES, processingof the RDMA request continues at 320. If NO, processing continues at 310where a call to the appropriate RDMA verb for the memory registrationprocess is made for memory region x. Processing of the RDMA memorytransaction request is then blocked by the RI executing the memoryregistration process until the memory registration process call isreturned as complete at 315. Processing of the RDMA memory transactioncannot be continued at 320 until the memory registration process iscomplete and the call is returned. Once the call is returned, the RIgenerates a work request for the transaction and this is enqueued in theappropriate queue (i.e. the SQ or the RQ) of the appropriate QP.Processing of the work request is then taken up by the adapter'stransaction pipeline as resources permit.

FIG. 3B illustrates a high-level procedural flow of the memoryregistration process 310 called by the process of FIG. 3A known to thoseof skill in the art. At 350, it is determined if the adapter resources(e.g. adapter memory resources) are available to perform the process andif NO, an error message is generated at 380 and processing returns at385. Otherwise, processing continues at 355 where the host allocates abuffer in host memory to establish a physical page list for the memoryregion to be later provided to the adapter. At 360, the host requeststhe list of pages that make up memory region x and has them pinned bythe operating system. Pinning the pages ensures that the host doesn'tchange the virtual to physical translation until the memory region is nolonger registered. The physical addresses for each of the pages are thenreturned by the operating system at 365. At 370, the host stores thephysical addresses within its page list. The adapter then receives andstores access rights along with the complete physical page list formemory region x at 375. Processing then returns to the callingapplication at 385 (for example, step 310, FIG. 3A). Thus, it can beseen from the foregoing procedural flows of the prior art that thememory registration process completely gates the processing of RDMA typememory transactions and thus any delays in processing memoryregistration operations can significantly increase the latency (andtherefore decrease throughput) of the network in processing networklevel RDMA type transactions.

FIG. 4 illustrates a high-level block diagram of an embodiment of aserver of the present invention. The server 600 includes one or morehost processors 610 and host memory 620. Server 600 also includes amemory management unit (MMU not shown) that permits the processor 610 toaccess the memory 620 as a contiguous virtual memory while performing avirtual to physical address translation to facilitate access to thenon-contiguous memory locations physically occupied by the memoryregions in the host memory 620. The server also includes a serverchipset 615 that interfaces through a known bus interface such as PCIbus with one or more RDMA compatible adapters 650 capable of variousthroughput rates over the network (e.g. 1 Gigabit 650 a and 10 Gigabit650 b). Each adapter includes adapter processing circuitry 652 and localadapter memory 654. Finally, physical interfaces 656 a, 656 b to thenetwork are also provided. It should be noted that the RDMA compatibleadapters 650 may be compatible with any known RDMA standard thatrequires or may benefit from the pre-registration of memory regions withnetwork adapters, including but not limited to Infiniband and iWARP.

FIGS. 5A and 5B illustrate high-level procedural flow diagram inaccordance with embodiments of the present invention for which thememory registration process has been pipelined with the processing ofnetwork level RDMA requests. It should be noted that the process ofexecuting the RDMA request that is necessitating the memory registrationappears virtually the same as that of FIG. 3A. This is because thepipelining of the memory registration process in accordance withembodiments of the present invention is transparent to the hostprocessor of the server and therefore to those processes performed bythe host processor. The difference is that when the RDMA verb for thememory registration process of memory region x is called, processingbegins at step 400 of FIG. 5A. A description of the procedural flow isnow presented with additional reference to FIGS. 6A and 6B, whichtogether illustrate a more detailed block diagram of the server 600 ofFIG. 4.

An application running on host processor (610, FIG. 6A) of server 600first initiates an RDMA-type memory transaction that results in a memoryregistration verb call appropriate to the particular RDMA type memorytransaction. Those of skill in the art will recognize that these verbcalls are standard or protocol specific and are defined in thespecification developed for the particular RDMA standard employed. Thepresent invention is intended to operate with all such standards thatrequire memory registration and/or virtual to physical translation ofpage lists for physical buffer locations.

Provided that the adapter resources (e.g. sufficient adapter memory 654,FIG. 6 a in which to store the requisite translated physical pageaddress information) are available as determined at 405, processingcontinues at steps 450, 455 and 460, where the physical addresses forthe individual pages contained in the page list 946, FIG. 6A are pinnedand imported into the host memory. For example, if x=N, then the memoryregion x that is to be registered corresponds to the contiguous virtualmemory region N 915, FIG. 6A which physically translates to physicalmemory pages in the host memory 620 having physical addresses P+8, P+6,P+1, and P+4. At 410 the host processor (610, FIG. 6A) enqueues aregister write to an MRTE Management register on register write queue(952, FIG. 6B). The MRT write interfaces through the local memoryinterface (LMI) (972, FIG. 6B) with memory registration table (MRT) 980located in adapter memory 654. This register write is communicatedthrough the PCI host interface of the Server Chip Set 615, FIG. 6A andcauses the Protocol Engine 901, FIG. 6B to set a pending bit in theappropriate MRT entry (MRTE) associated with memory region x (forexample, MRTE N of MRT 980, FIG. 6B). Processing then continues at 415,FIG. 5A where the host processor (610, FIG. 6A) posts a memoryregistration operation (MR OP) onto the send queue (SQ) 940, FIG. 6A ofthe control queue pair (CQP) 939 being maintained in the host memory620, FIG. 6A of the server. The MR OP points to a pinned host page list946 for the memory region being registered. The host processor (610,FIG. 6A) then enqueues on queue 952, FIG. 6B a register write to thework queue entry (WQE) allocate register to let the CP 964, FIG. 6B knowthat is has work pending in the CQP 939, FIG. 6A. Processing continuesby returning to the calling application at 430, FIG. 5A.

Thus, the RI is now free to continue processing the RDMA type memoryrequest operation to this memory region x even though the actualregistration process may not as of yet begun. The RI is now free to posta work request on the appropriate QP to initiate the processing of thetransaction. This also involves a write to the WQE allocate register,which informs the CUWS 954, FIG. 6B of the work request that must beprocessed.

Once received over the SQbus and scheduled for execution by the contextupdate and work scheduler (CUWS) 954, FIG. 6B the CP 964 beginsexecution of the MR OP that was posted in the SQ 940, FIG. 6B of the CQP939, thereby performing the memory registration process beginning at420, FIG. 5B. At 463 the adapter 652, FIG. 6B, more specifically the CP964, FIG. 6B, sets up an MRTE, such as MRTE 981, FIG. 6B; and pulls thepage list, such as page list 946, FIG. 6A which contains the entries formemory region N 615, FIG. 6A, into a physical buffer list (PBL) 978,FIG. 6B. Once this process is complete, processing continues at 465where the pending bit in MRTE N (memory region x=N) is reset by the MRTEupdate process to indicate completion of the MR OP for that memoryregion. At this point, a completion entry may be sent from the MRTEupdates process to the CQ 944, FIG. 6 a of the CQP 939 to indicate thatthe entry containing the completed MR OP may now be reused for anothercontrol operation.

Thus, the memory registration process and the associated work requestare able to proceed in parallel and independent of one another. The CP964 is free to process management control operations (including the MROPs) posted to the CQP 939 and the transaction pipeline (including thetransmit (TX) 966 and receive (Rx) 968 pipelines) proceed withprocessing the QP work request (500, FIG. 5C) independently. Thus, inembodiments of the invention, the transaction processing for the QP workrequest referencing memory region x=N may begin independently of thecompletion of the MR OP for memory region x=N. If the adaptertransaction pipeline gets ahead in processing of the QP work requestprior to completion of the MR OP, the transaction pipeline willrecognize that the pending bit is still set (502, FIG. 5C) and willeither stall the execution of the work request or it will re-schedulethe work request and resume processing it once the pending bit for thatmemory region has been reset (504, FIG. 5C). In the case of the reverse,the MR OP completes first and thus the physical page list is availablewhen the QP work request is being executed. The adapter transactionpipeline is then able to access the appropriate physical memorylocations to sink or source data in completing the QP OP by acquiringthe physical page addresses from the PBL associated with the memoryregion specified by the QP work transaction.

Specific examples of the pipelined execution of work requests inparallel with the memory registration operations in accordance withembodiments of the invention are illustrated in FIGS. 7 and 8. In FIG.7, a sequence diagram is provided for illustrating one possible outcomescenario in the processing of an RDMA type memory transaction called a“local SEND operation.” Each column of the figure represents one of theprocessing nodes that may be involved in the overall processing of thetransaction, including the local application node (includes a localapplication program running on the host processor 610, FIG. 6A of alocal server), the adapter control processor (CP) 964, FIG. 6B runningon the adapter 652, FIG. 6B of the local server that executes controloperations including memory registration operations, the local adaptertransmit 966, FIG. 6B/receive 968, FIG. 6B transaction pipeline that isexecuted on the adapter 652, FIG. 6B for handling work requestsassociated with RDMA-type memory requests transactions, and finally theapplication running on the remote server of cluster (not shown) withwhich the local server is or may be communicating over the network. Eachrow of the diagram indicates a non-specific time frame which issequentially later in time than the row above it.

The example of FIG. 7 illustrates the pipelined processing of a localSEND operation, which does not actually involve the movement of data toanother server on the network. It is nevertheless an RDMA transactionthat requires that the memory region comprising the source buffer forthe data first be registered with the local adapter. Thus, as indicatedin the first time point of the sequence (Row 1), the first step for theRI running on the local server host processor 610, FIG. 6A in accordancewith embodiments of the invention is to call the memory registrationverb for the memory region specified by the SEND operation. The callingof this process includes enqueueing of an MRTE Management register writeon register write queue 952, FIG. 6B (as previously described) that setsthe pending bit for the MRTE in the MRT 980, FIG. 6B corresponding tothe memory region that needs to be registered. The MR OP associated withthe memory registration necessitated by the local SEND OP is then postedto the SQ of the local adapter's CQP. The MR OP entry in the SQ includesall of the relevant information for the memory region, including a pagelist pointer and access privileges for the region, etc. Finally, a WQEallocate register write is posted to the registration write queue 952,FIG. 6B by the local server host processor 610, FIG. 6A to ring thedoorbell of the CUWS 954, FIG. 6B to let it know that a control op hasbeen posted for it to schedule and process.

As previously mentioned, a call to the registration verb is returnedafter the foregoing steps have been performed, notwithstanding that theMR OP has not yet been processed. As shown in Row 2, this permits the RIrunning on the local server host processor 610, FIG. 6A to post itslocal SEND OP on one of its QPs (e.g. QP_(N)). This process alsoincludes a WQE Allocate register write that again rings the doorbell tolet the adapter's work scheduler 954, FIG. 6B know that a transmitpipeline operation needs to be processed. Simultaneously with, before orafter the foregoing steps, the local adapter CP 964, FIG. 6B beginsprocessing the MR OP posted to the SQ of the CQP previously and inaccordance with the procedure flow diagram of FIG. 5B. As indicated,this includes setting up the MRTE in the adapter memory for the memoryregion x, pulling the page list from the host memory and obtaining thephysical translations for the page addresses and storing them in aphysical buffer list (PBL) in local adapter memory 654.

As indicated in Row 3 of FIG. 7, at some point subsequent to the postingof the QP SEND op, the local transmit pipeline begins processing theposted SEND op. If the processing of the MR OP has not completed (oreven started for that matter), then the pending bit will be set and theprocessing of the SEND op is suspended until the pending bit is reset,indicating that registration for the memory region accessed by the SENDop has been completed. In Row 4 of the parallel sequence of FIG. 7, theMR OP processing is completed and the pending bit is subsequentlycleared as a result. A completion indicator may be sent to thecompletion queue CQ 944, FIG. 6A of the CQP 939 over the CQbus by theMRT update process 956, FIG. 6B running on CP 964 to indicate that theSQ entry formerly occupied by the MR OP is now available by which toqueue other control ops such as another MR OP for another RDMA typememory transaction. Finally, as shown in Row 5, once this has beenaccomplished, the work scheduler 954 can reschedule the processing ofthe SEND op and processing is resumed until completed by the transmitpipeline processor 966.

Those of skill in the art will appreciate that the example of FIG. 7illustrates only one possible sequence for the resolution of theoperations running in parallel. The sequence shown in FIG. 7 serves toillustrate the scenario where the MR OP loses the race to the processingof the SEND OP and as such the SEND OP must be suspended until thememory registration process is complete. Also possible (and more likely)is that processing the MR OP is completed ahead of the SEND OP and thusno suspension of the transmit pipeline or rescheduling of the SEND OPwould be necessary.

In the example of FIG. 8, a remote RDMA Write transaction is illustratedthat requires communication with a remote node of the network. In thiscase, the process starts out in the time period delineated by Row 1 asit did in the example of FIG. 7, wherein a memory registration verb callis made by the RI running on the local server in connection with theRDMA Write. As a result of that call, a register write to the MRTEManagement register is queued which establishes an MRTE entry for thememory region x to be registered and sets the pending bit in that entry.An MR OP is then posted to the SQ of the CQP of the local serverrepresenting the registration that must take place as previouslydiscussed. Finally, a write to the WQE allocate register notifies (ringsthe door bell) for the Context Update and Work Scheduler 954 to notifyit of the pending work (i.e. the MR OP) in the CQP. The server processorthen returns from the verb call.

Once returned from the verb call, the RI is free to post a SEND OP onits QP_(N) that advertises to the remote application running on theremote server that the source of the data will be sourced from memoryregion x using an STag=x. Those of skill in the art will recognize thatthe STag (also know as a Steering Tag) is the format defined by theiWARP specification for identifying memory regions. This posted SENDalso requests an RDMA write operation. This is indicated in Row 2 of thepipelined sequence. This posted SEND also includes a write to the WQEallocate register to notify the Context Update and Work Scheduler 954,FIG. 6B to notify it of the pending work in the QP_(N). Before,simultaneously with or after the foregoing activities, the local adapterCP may begin processing the MR OP for memory region x. This involvessetting up the MRTE, pulling the page list from host memory and settingup the physical buffer list in the adapter memory as previouslydescribed.

At some time in the future, the TX pipeline of the local adapter beginsto process the SEND OP, but because this SEND OP does not require accessto the memory region x, its processing does not need to be suspendednotwithstanding that the MR OP has not yet completed. This step isindicated in Row 3 of the sequence. Sometime after, as indicated in Row4, the remote node receives the SEND OP requesting the RDMA Writeoperation to the memory region x STag and this is posted on the SQ ofthe remote node's QP.

At some point in the future, as indicated in Row 5, the local serveradapter's RX pipeline receives the RDMA write as a work request from theremote server, but because the memory region x is going to be the sinkfor this transaction, and because in this scenario the pending bit hasyet to be cleared for memory region x because the MR OP has not beencompleted, the RX pipeline processing of this RDMA write work request issuspended until that happens. Finally, in Row 6, the MR OP has beencompleted and the pending bit has been cleared through mechanismspreviously discussed, and thus the RDMA Write Op work request is resumedand completed to memory region x subsequently in Row 7.

Those of skill in the art will appreciate that it is much more likelythat the MR OP will have been completed while the servers are exchangingoperations (i.e. Rows 3, 4 and 5) and that the completion of the RDMAtransaction will not be held up. Moreover, it should be appreciated thatthe scenario illustrated in FIG. 8 might be any one of a number ofpossible sequences but is for purposes of illustrating the scenariowhere the RDMA Write OP work request wins the race with the MR OPprocess and requires stalling of the adapter pipeline processing of thenetwork transaction pending completion of the memory registrationprocess for memory region x.

RDMA Read operations are similar to the RDMA Write operations as shownin FIG. 8 except that the remote node does not need to post an operationas the remote node simply performs the read operation.

FIG. 9 illustrates an embodiment of a network topology whereinclustering transactions, user transactions and high-performance storageaccess transactions may all be performed over a single fabric such asEthernet 108. In an embodiment, the pipelining of the memoryregistration process occurs in the same manner as previously discussed.However, the embodiment of FIG. 9 has the additional advantage ofrequiring only one adapter and one fabric to process all networktransaction types. A more detailed view of an embodiment of a protocolengine 901, FIG. 6B is illustrated in FIG. 10 that is capable ofperforming the pipelining of the memory registration process (and allmanagement control operations) as previously described as well as tohandle all three types of network transactions with one adapter.

FIG. 10 is a more detailed block diagram of an embodiment of theprotocol engine 901, FIG. 6B of the adapter 650 of FIG. 6B. Thisembodiment of the protocol engine 901 is useful to achieve the systemtopology of FIG. 9 as it provides functionality necessary to handleuser, clustering and high-performance storage access transactions over asingle adapter and thus a single Ethernet network. As previouslydiscussed, the adapter handles both traditional RDMA memory transactionwork requests as well as management control operations including memoryregistration operations. The TX 966/RX 968 FIG. 6B pipeline includesvarious processing stages to handle the processing of the three types ofnetwork transactions, depending upon whether they are user transactions(e.g. TCP/IP packetized data over conventional sockets typeconnections), RDMA offloaded connections (iWARP connections for directdata placement), or high-performance storage transactions (such as inaccordance with the iSCSI standard).

As shown in the block diagram of FIG. 6B, a protocol engine arbiter958/960 is connected to the transaction switch 970 and the local memoryinterface 972 to provide a point of contact between the protocol engine901 and those devices. Various subcomponents of the transaction pipelineof the protocol engine 901 have their access arbitrated to those twodevices by the protocol engine arbiter 958/960. In basic operation aseries of tasks are performed by the various modules or sub-modules inthe protocol engine 901 to handle the various iWARP, iSCSI and regularEthernet traffic (including the QP operations). A context manger 1015has a dedicated datapath to the local memory interface. As eachconnection which is utilized by the adapter 650 must have a context,various subcomponents or submodules are connected to the context manager1015 as indicated by the arrows captioned cm. The context manager 1015contains a context cache 1014, which caches the context from the localadapter memory 654, FIG. 6B, and a work available memory region cache1013, which contains memory used to transmit scheduling algorithms todetermine which operations occur next in the protocol engine 901.

The schedules are effectively developed in a work queue manager (WQM)1025. The WQM 1025 handles scheduling for all transmissions oftransactions of all protocol types in the protocol engine 901. One ofthe main activities of the WQM 1025 is to determine when data needs tobe retrieved from the adapter memory 654, FIG. 6B for operation by oneof the various modules. The WQM 1025 handles this operation byrequesting a time slice from the protocol engine arbiter 958/960 toallow the WQM 1025 to retrieve the desired information and place it in awork queue. A completion queue manager (CQM) 1050 acts to provide taskcompletion indications to the CPUs 610. The CQM 1050 handles this taskfor various submodules with connections to those submodules indicated byarrows captioned by cqm. A doorbell submodule 1005 receives commandsfrom the host, such as “a new work item has been posted to SQ x,” andconverts these commands into the appropriate context updates.

A TCP off-load engine (TOE) 1035 includes sub modules of transmit logicand receive logic to handle processing for accelerated TCP/IPconnections. The receive logic parses the TCP/IP headers, checks forerrors, validates the segment, processes received data, processesacknowledges, updates RTT estimates and updates congestion windows. Thetransmit logic builds the TCP/IP headers for outgoing packets, performsARP table look-ups, and submits the packet to the transaction switch970, FIG. 6B. An iWARP module 1030 includes a transmit logic portion anda receive logic portion. The iWARP module 1030 implements various layersof the iWARP specification, including the MPA, DDP and RDMAP layers. Thereceive logic accepts inbound RDMA messages from the TOE 1035 forprocessing. The transmit logic creates outbound RDMA segments from PCIdata received from the host CPUs 610, FIG. 6A. A NIC module 1040 ispresent and connected to the appropriate items, such as the work queuemanager 1025 and the protocol engine arbiter 958/960. An iSCSI optionalmodule 1045 is present to provide hardware acceleration to the iSCSIprotocol as necessary.

Typically the host operating system provides the adapter 650 with a setof restrictions defining which user-level software processes are allowedto use which host memory address ranges in work requests posted to theadapter 650. Enforcement of these restrictions is handled by anaccelerated memory protection (AMP) module 1028. The AMP module 1028validates the iWARP STags using the memory region table (MRT) 980, FIG.6B and returns the associated physical buffer list (PBL) information. AnHDMA block 1031 is provided to handle the DMA transfer of informationbetween host memory 620, via the bus 950, and the transaction switch 970on behalf of the WQM 1025 or the iWARP module 1030. An ARP module 1032is provided to retrieve MAC destination addresses from the memory. Afree list manager (FLM) 1034 is provided to work with various othermodules to determine the various memory blocks which are available.

As previously discussed, when work has been placed on a QP or a CQP, adoorbell is rung to inform the protocol engine 901 that work has beenplace in those queues that must be performed. Doorbell 1005 is providedto form an interface between the host CPU 610, FIG. 6A and the protocolengine 901 to allow commands to be received and status to be returned.The protocol engine 901 of the preferred embodiment also contains aseries of processors to perform required operations. As previouslydiscussed, one of those processors is a control queue processor (CP) 964that handles management control operations such as the memoryregistration operations. In this way, the control operations such as theMR OPs are given their own pipeline in which to be processed in parallelwith the QP transmit 966/receive 968 pipeline formed by the componentsdiscussed above for processing QP work requests. The control queueprocessor CP 964 performs commands submitted by the various host driversvia control queue pairs CQPs 939, FIG. 6A as previously outlined above.

The CP 964 has the capability to initialize and destroy QPs and memorywindow and regions. As previously discussed, while processing RDMA QPtransactions, the iWARP module 1030 and other QP transaction pipelinecomponents monitor the registration status of the memory regions asmaintained in the MRT in the adapter memory and will stall any QP workrequests referencing memory regions for which registration has not yetcompleted (i.e. for which the pending bit is still set). Stalled QP workrequests can be rescheduled in any manner known to those of skill in theart. The rescheduled QP work transactions will be permitted to completewhen a check of the pending bit for the referenced memory region of eachwork request has been cleared.

A second processor is the out-of-order processor (OOP) 1041. Theout-of-order processor 1041 is used to handle the problem of TCP/IPpackets being received out-of-order and is responsible for determiningand tracking the holes and properly placing new segments as they areobtained. A transmit error processor (TEP) 1042 is provided forexception handling and error handling for the TCP/IP and iWARPprotocols. The final processor is an MPA reassembly processor 1044. Thisprocessor 1044 is responsible for managing the receive window buffer foriWARP and processing packets that have MPA FPDU alignment or orderingissues.

Embodiments of the present invention have been disclosed herein thatprovide a pipeline for handling management control operations such asmemory registration that is independent of the one that handles QP workrequests generated for RDMA type memory transactions. In embodiments ofthe invention, the queue pair paradigm is leveraged to make integrationof the control pipeline with the QP work request pipeline morestraightforward. The QP work request pipeline monitors the completion ofpending memory registration operations for each memory region, andstalls the processing of any QP transactions using memory regions forwhich registration has not completed. Because most of the controloperations will complete before the processing of their associated QPwork requests complete, the latency that is typically associated withthe control operations such as memory registration is eliminated andthroughput of the network is increased. Because the processing of thoseQP work requests that do win the race may be suspended and rescheduled,the serial nature of the registration process is still maintained perexisting RDMA standards, and the mechanism is hidden from theapplications running on the servers in a network such as a servercluster.

It will be understood from the foregoing description that modificationsand changes may be made in various embodiments of the present inventionwithout departing from its true spirit. The descriptions in thisspecification are for purposes of illustration only and are not to beconstrued in a limiting sense. The scope of the present invention islimited only by the language of the following claims.

1. A method of pipelining the processing of RDMA type networktransactions, said method comprising: calling a control operation inresponse to an RDMA type transaction request; queuing the controloperation in a first pipeline for processing control operations andsetting a pending bit associated with the control operation; queuing awork request in a second pipeline for processing work requests, the workrequest generated in accordance with the RDMA type transaction; andprocessing the control operation in the first pipeline and the workrequest in the second pipeline, said processing further comprising:resetting the pending bit when said processing of the control operationis complete; if said processing of the work request is to commence andthe pending bit is still set, delaying processing of the work request;and completing said processing of the work request whenever the pendingbit is reset.
 2. The method of claim 1 wherein said queuing the controloperation further comprises posting the control operation in a controlqueue pair.
 3. The method of claim 2 wherein said setting a pending bitfurther comprises performing a register write operation to a table entryassociated with the control operation.
 4. The method of claim 3 whereinsaid queuing the control operation comprises ringing a doorbell tonotify the first pipeline that the control operation has been posted. 5.The method of claim 1 wherein the work request is a memory registrationoperation for a memory region specified by the RDMA type transaction,and processing the work request requires access to the registered memoryregion.
 6. The method of claim 1 wherein delaying processing of the workrequest comprises rescheduling the work request in the second pipeline.7. The method of claim 1 wherein the second pipeline is a transmitpipeline.
 8. The method of claim 1 wherein the second pipeline is areceive pipeline, the method further comprising processing the workrequest in a third pipeline without referencing the pending bit, thethird pipeline being a transmit pipeline.
 9. A method of pipelining theprocessing of RDMA type network transactions, said method comprising:calling an RDMA memory registration verb in response to an RDMA typetransaction request, said calling further comprising: queuing a memoryregistration operation in a first pipeline for processing controloperations; and setting a pending bit associated with the memoryregistration operation; queuing a work request in a second pipeline forprocessing work requests, the work request generated in accordance withthe RDMA type transaction; and processing the memory registrationoperation in the first pipeline and the work request in the secondpipeline, said processing further comprising: resetting the pending bitwhen said processing of the memory registration operation is complete;if said processing of the work request is to commence and the pendingbit is still set, delaying processing of the work request; andcompleting said processing of the work request whenever the pending bitis reset.
 10. The method of claim 9 wherein said queuing a memoryregistration operation further comprises: posting the memoryregistration operation on a send queue of a control queue pair; andringing a doorbell to notify the first pipeline that the memoryregistration operation is queued for processing.
 11. The method of claim10 wherein said queuing a work request further comprises: posting thework request to a queue pair; and ringing a doorbell to notify thesecond pipeline that the work request is queued for processing.
 12. Themethod of claim 11 wherein said ringing a doorbell to notify the secondpipeline further comprises writing a work queue allocate register with awork queue element that represents the work request.
 13. The method ofclaim 9 wherein said setting and said resetting the pending bit furthercomprises writing a memory region table update register coupled to amemory region table comprising an entry associated with the memoryregistration operation.
 14. The method of claim 9 wherein delayingprocessing of the work request comprises rescheduling the work requestin the second pipeline.
 15. The method of claim 9 wherein the secondpipeline is a transmit pipeline.
 16. The method of claim 9 wherein thesecond pipeline is a receive pipeline, the method further comprisingprocessing the work request in a third pipeline without referencing thepending bit, the third pipeline being a transmit pipeline.
 17. Anapparatus for pipelining the processing of RDMA type networktransactions, said apparatus comprising: means for calling a controloperation in response to an RDMA type transaction request; means forqueuing the control operation in a first pipeline for processing controloperations and means for setting a pending bit associated with thecontrol operation; means for queuing a work request in a second pipelinefor processing work requests, the work request generated in accordancewith the RDMA type transaction; and means for processing the controloperation in the first pipeline and the work request in the secondpipeline, said means for processing further comprising: means forresetting the pending bit when said processing of the control operationis complete; means for delaying said processing of the work requestwhenever the pending bit is still set and processing of the work requestis to commence; and means for completing said processing of the workrequest whenever the pending bit is reset.
 18. The apparatus of claim 17wherein said means for queuing the control operation further comprisesmeans for posting the control operation in a control queue pair.
 19. Theapparatus of claim 18 wherein said means for setting a pending bitfurther comprises means for performing a register write operation to atable entry associated with the control operation.
 20. The apparatus ofclaim 19 wherein said means for queuing the control operation comprisesmeans for ringing a doorbell to notify the first pipeline that thecontrol operation has been posted.
 21. The apparatus of claim 17 whereindelaying processing of the work request comprises rescheduling the workrequest in the second pipeline.
 22. The apparatus of claim 17 whereinthe second pipeline is a transmit pipeline.
 23. The apparatus of claim17 wherein the second pipeline is a receive pipeline, the apparatusfurther comprising means for processing the work request in a thirdpipeline without referencing the pending bit, the third pipeline being atransmit pipeline.
 24. An apparatus for pipelining the processing ofRDMA type network transactions, said apparatus comprising: means forcalling an RDMA memory registration verb in response to an RDMA typetransaction request, said means for calling further comprising: meansfor queuing a memory registration operation in a first pipeline forprocessing control operations; and means for setting a pending bitassociated with the memory registration operation; means for queuing awork request in a second pipeline for processing work requests, the workrequest generated in accordance with the RDMA type transaction; andmeans for processing the memory registration operation in the firstpipeline and the work request in the second pipeline, said means forprocessing further comprising: means for resetting the pending bit whensaid processing of the memory registration operation is complete; meansfor delaying said processing of the work request whenever the pendingbit is still set and processing of the work request is to commence; andmeans for completing said processing of the work request whenever thepending bit is reset.
 25. The apparatus of claim 24 wherein said meanfor queuing a memory registration operation further comprises: means forposting the memory registration operation on a send queue of a controlqueue pair; and means for ringing a doorbell to notify the firstpipeline that the memory registration operation is queued forprocessing.
 26. The apparatus of claim 25 wherein said means for queuinga work request further comprises: means for posting the work request toa queue pair; and means for ringing a doorbell to notify the secondpipeline that the work request is queued for processing.
 27. Theapparatus of claim 26 wherein said means for ringing a doorbell tonotify the second pipeline further comprises writing a work queueallocate register with a work queue element that represents the workrequest.
 28. The apparatus of claim 24 wherein delaying processing ofthe work request comprises rescheduling the work request in the secondpipeline.
 29. The apparatus of claim 24 wherein the second pipeline is atransmit pipeline.
 30. The apparatus of claim 24 wherein the secondpipeline is a receive pipeline, the apparatus further comprising meansfor processing the work request in a third pipeline without referencingthe pending bit, the third pipeline being a transmit pipeline.
 31. Acomputer system for connection to a network for pipelining theprocessing of RDMA type network transactions, said computer systemcomprising: a processor; memory coupled to said processor and includinga plurality of memory regions; an adapter coupled to said processor andsaid memory and for connection to the network, said adapter comprising:a control operation portion having a first pipeline for processingcontrol operations; a work request portion having a second pipeline forprocessing work requests; and a memory region table; and a programexecuting on said processor, said program: queuing a control operationof an RDMA transaction request in said first pipeline and setting apending bit in said memory region table associated with said controloperation and the relevant memory region of said plurality of memoryregions; and queuing a work request of said RDMA transaction request forsaid second pipeline, wherein said control operation portion resets saidpending bit when processing of said control operation is complete, andwherein said work request portion delays processing of said work requestwhenever said pending bit is still set and processing of said workrequest is to commence; and wherein said work request portion completesprocessing of said work request whenever the pending bit is reset. 32.The computer system of claim 31 wherein said memory includes a controlqueue pair and wherein queuing said control operation further comprisesposting said control operation in said control queue pair.
 33. Thecomputer system of claim 32 wherein said adapter further comprises aregister to receive entries for said memory region table and whereinsetting said pending bit further comprises performing a register writeoperation to said register.
 34. The computer system of claim 33 whereinsaid adapter further comprises a doorbell to receive notification thatsaid control operation has been posted and wherein queuing said controloperation comprises ringing said doorbell.
 35. The computer system ofclaim 31 wherein delaying processing of said work request comprisesrescheduling said work request in said second pipeline.
 36. The computersystem of claim 31 wherein said second pipeline is a transmit pipeline.37. The computer system of claim 31 wherein said second pipeline is areceive pipeline, and wherein said work request portion processes thework request in a third pipeline without referencing the pending bit,said third pipeline being a transmit pipeline.
 38. A computer system forconnection to a network for pipelining the processing of RDMA typenetwork transactions, said computer system comprising: a processor;memory coupled to said processor and including a plurality of memoryregions; an adapter coupled to said processor and said memory and forconnection to the network, said adapter comprising: a control portionhaving a first pipeline for processing control operations, said controloperations including memory registration; a work request portion havinga second pipeline for processing work requests; and a memory regiontable; and a program executing on said processor, said program: queuinga memory registration operation of an RDMA transaction request in saidfirst pipeline and setting a pending bit in said memory region tableassociated with said memory registration operation and the relevantmemory region of said plurality of memory regions; and queuing a workrequest of said RDMA transaction request for said second pipeline,wherein said control portion resets said pending bit when processing ofsaid memory registration operation is complete, and wherein said workrequest portion delays processing of said work request whenever saidpending bit is still set and processing of said work request is tocommence; and wherein said work request portion completes processing ofsaid work request whenever the pending bit is reset.
 39. The computersystem of claim 38 wherein said memory includes a control queue pair,said adapter further comprises a control operation doorbell to receivenotification that a control operation has been posted and whereinqueuing said memory registration operation further comprises: postingsaid memory registration operation on a send queue of said control queuepair; and ringing said control operation doorbell.
 40. The computersystem of claim 39 wherein said memory includes a work request queuepair, wherein said adapter further comprises a work request doorbell toreceive notification that a work request has been posted and whereinqueuing said work request further comprises: posting said work requestto said work request queue pair; and ringing said work request doorbell.41. The computer system of claim 40 wherein said adapter furthercomprises a work queue allocate register and wherein queuing said workrequest further comprises writing said work queue allocate register witha work queue element that represents said work request.
 42. The computersystem of claim 38 wherein delaying processing of said work requestcomprises rescheduling said work request in said second pipeline. 43.The computer system of claim 38 wherein said second pipeline is atransmit pipeline.
 44. The computer system of claim 38 wherein saidsecond pipeline is a receive pipeline, and wherein said work requestportion processes the work request in a third pipeline withoutreferencing the pending bit, said third pipeline being a transmitpipeline.
 45. A method of pipelining the processing of RDMA type networktransactions, said method comprising: calling a control operation inresponse to an RDMA type transaction request; queuing the controloperation in a first pipeline for processing control operations andsetting a pending bit associated with the control operation; queuing awork request in a second pipeline for processing work requests, the workrequest generated in accordance with the RDMA type transaction, the workrequest generating an RDMA operation from a remote node and the RDMAoperation from the remote node generating RDMA work processed by a thirdpipeline; and processing the control operation in the first pipeline andthe RDMA work in the third pipeline, said processing further comprising:resetting the pending bit when said processing of the control operationis complete; if said processing of the RDMA work is to commence and thepending bit is still set, delaying processing of the RDMA work; andcompleting said processing of the RDMA work whenever the pending bit isreset.
 46. The method of claim 45 wherein said processing furthercomprises processing the work request in the second pipeline withoutreferencing the pending bit.
 47. The method of claim 46 wherein thesecond pipeline is a transmit pipeline and the third pipeline is areceive pipeline.
 48. An apparatus for pipelining the processing of RDMAtype network transactions, said apparatus comprising: means for callinga control operation in response to an RDMA type transaction request;means for queuing the control operation in a first pipeline forprocessing control operations and setting a pending bit associated withthe control operation; means for queuing a work request in a secondpipeline for processing work requests, the work request generated inaccordance with the RDMA type transaction, the work request generatingan RDMA operation from a remote node and the RDMA operation from theremote node generating RDMA work processed by a third pipeline; andmeans for processing the control operation in the first pipeline and theRDMA work in the third pipeline, said means for processing furthercomprising: means for resetting the pending bit when said processing ofthe control operation is complete; means for delaying processing of theRDMA work if said processing of the RDMA work is to commence and thepending bit is still set; and means for completing said processing ofthe RDMA work whenever the pending bit is reset.
 49. The apparatus ofclaim 48 wherein said means for processing further comprises means forprocessing the work request in the second pipeline without referencingthe pending bit.
 50. The apparatus of claim 49 wherein the secondpipeline is a transmit pipeline and the third pipeline is a receivepipeline.
 51. A computer system for connection to a network forpipelining the processing of RDMA type network transactions, saidcomputer system comprising: a processor; memory coupled to saidprocessor and including a plurality of memory regions; an adaptercoupled to said processor and said memory and for connection to thenetwork, said adapter comprising: a control operation portion having afirst pipeline for processing control operations; a work request portionhaving a second pipeline for processing work requests; an RDMA workportion having a third pipeline for processing RDMA work; and a memoryregion table; and a program executing on said processor, said program:queuing a control operation of an RDMA transaction request in said firstpipeline and setting a pending bit in said memory region tableassociated with said control operation and the relevant memory region ofsaid plurality of memory regions; and queuing a work request of saidRDMA transaction request for said second pipeline, said work requestgenerated in accordance with the RDMA type transaction, said workrequest generating an RDMA operation from a remote node and said RDMAoperation from the remote node generating RDMA work processed by saidthird pipeline; wherein said control operation portion resets saidpending bit when processing of said control operation is complete, andwherein said RDMA work portion delays processing of said RDMA workwhenever said pending bit is still set and processing of said RDMA workis to commence; and wherein said RDMA work portion completes processingof said RDMA work whenever said pending bit is reset.
 52. The computersystem of claim 51 wherein said work request portion further processessaid work request in said second pipeline without referencing thepending bit.
 53. The computer system of claim 52 wherein said secondpipeline is a transmit pipeline and said third pipeline is a receivepipeline.